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Abstract. We develop a maximum penalized quasi-likelihood estimator for estimating in a non- 
parametric way the diffusion function of a diffusion process, as an alternative to more traditional 
kernel-based estimators. After developing a numerical scheme for computing the maximizer of the 
penalized maximum quasi-likelihood function, we study the asymptotic properties of our estima- 
tor by way of simulation. Under the assumption that overnight London Interbank Offered Rates 
(LIBOR); the USD/EUR, USD/GBP, JPY/USD, and EUR/USD nominal exchange rates; and 
1-month, 3-month, and 30-year Treasury bond yields are generated by diffusion processes, we use 
our numerical scheme to estimate the diffusion function. 



0. Introduction 

One of the key achievements in the field of financial engineering is the representation of the 
price of contingent claims as the expectation of their discounted future payoffs under the so-called 
risk-neutral probability, i.e., a probability measure under which discounted (by the exponential of 
integrated short rate) traded asset prices are martingales. In complete markets, this representation 
permits the computation of the unique arbitrage- free price, as well as the hedging strategy needed 
to remove all the risk associated with issuing (or writing) the contingent claim. 

From a practical viewpoint, there is a major issue with choosing the form of the risk- neutral 
probability measure, as it depends on the model specification. In other words, the stochastic 
movement of asset prices must be modeled in a way that is consistent with observed data collected 
from the market. As equivalent changes of probability leave the quadratic variation of a process 
intact, market data provide a possible way to pin down the volatility component; then, the drift 
rate under the risk-neutral measure is simply set equal to the applicable short rate. 

The most elementary continuous-time model for asset prices is probably geometric Brownian 
motion, for which the log-price dynamics are characterized by both constant drift rate and constant 
volatility. This particular choice of parameters is not consistent with various asset prices like stock 
prices, exchange rates, or interest rates. Therefore, more elaborate models have to be utilized. One 
direct generalization of geometric Brownian motion are local volatility models, where the diffusion 
coefficient is a function of the underlying asset's level. 

In this paper, we consider the class of diffusion models with coefficient functions that are allowed 
to depend on the asset level. We aim to develop a nonparametric estimation procedure for the 
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diffusion function based on a maximum penalized quasi- likelihood method, following the work of 

In Section [TJ we consider a one-dimensional diffusion model for the movement in the price of a 
financial asset, and briefly review the history of attempts to estimate the diffusion function a. In 
Section [2j we develop a quasi- likelihood function for a diffusion process and then add a penalization 
term (also known as a regularization term) to obtain a penalized quasi-likelihood function. After 
establishing the existence of a maximizer 9* of the penalized quasi-likelihood function, we use 
techniques from the calculus of variations to justify a property of 0* which, in turn, permits us 
to introduce a numerical scheme for calculating 0* along discrete, non-uniform design points. In 
Section [3j we study two simulated diffusion processes and note that the mean integrated squared 
error of our estimator converges to zero at rates that seem to be comparable to the rates of 
convergence achieved by kernel-based estimators. Then, in Section [4j we use software to estimate 
the diffusion function for overnight London Interbank Offered Rates (LIBOR), 1-month and 30- year 
Treasury bond yields, and the USD/EUR, USD/GBP, and JPY/USD exchange rates. 

1. Discussion of the Problem 

1.1. The market model. We consider a diffusion model for the price movement of a financial 
asset. In particular, we study a one-dimensional diffusion (it)te[o,Tl with Yq = y and dynamics 
given by 

(1.1) dY t = b(Y t )dt + a(Y t )dW t , t € [0,T\. 

Here, W is a standard Brownian motion, b : R H > R and a : R i— > M ++ = (0, oo) are Borel- 
measurable functions, and T > is a fixed time horizon. With C([0, T];R) denoting the canonical 
path-space of continuous functions equipped with the Borel sigma field, the dynamics described in 



fll.l|) are valid under the probability Pf ,<T) on C([0,T];M), which is such that Ff a) [Y a = y] = 1 



In order for the problem to be well-posed, we assume that the stochastic differential equation ( 1.1 ) 



has a weak solution that is unique in the sense of probability law. For conditions that guarantee 



the existence of a weak solution to (1.1), see [23, Theorem 5.4, pg. 332]. Of course, uniqueness of 
a weak solution is a necessary condition for the well-posedness of statistical estimation problems 
involving diffusions. 

Remark 1.1. The assumption that the functions b and a have domain R is made because in the 
analysis we shall use a conditional Gaussian approximation for the transition densities of the 
diffusion Y . In practice, the diffusion can live on any sub-interval of R, such as (0, oo); this situation 
will create only theoretical obstacles. Ultimately, it is usually the case that a scale transformation 
of Y will result in a related diffusion with full support on R; for example, a log-transformation of 
a diffusion supported on (0, oo) results in a new diffusion supported on R. 

1.2. Nonparametric estimation of the diffusion function. We consider the problem of esti- 



mating in a nonparametric way the coefficient a that appears in equation (1.1) under the assump 



tion that P^'"^ is the historical (or statistical, or real-world) probability law. The estimation of a 
is of crucial importance, since it stays fixed under the equivalent changes of probability measure 
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that are necessary for pricing and hedging contingent claims. More specifically, recall that when 
we pass from the historical probability law to the risk-neutral probability law, a remains unaltered 
while b changes. 

In the case in which we continuously observe data over the time interval [0, T], i.e., when the 
whole path (it)t e [o,T] is observed, perfect estimation of a is possible, at least in the window of 
observations [m^,Mr], where we define 

(1.2) rriT '■= min Yt and Mj- := max If. 

te[o,T] te[o,T] 

Indeed, the quadratic variation of Y under p( b >°") is (Y,Y) = J a 2 (Y t )dt and hence o~ 2 (Y t ) = 
d(Y,Y)t/dt for t £ [0, T]. In this case of continuous observations, given perfect estimation of 
a, [21] contains substantial results regarding estimation of the drift b in both parametric and 
nonparametric settings. 

Of course, the case of continuous observations is only a theoretical idealization. In practice, we 
are usually presented with discrete observations Y to ,Y tl , . . . , Y tn , where = to < . . . < t n = T. 
Typically, researchers study two distinct problems for the case in which we only have discrete 
observations at our disposal. In the case of low sampling frequency, the sampling interval remains 
fixed, and the behavior of estimators is studied as T tends to infinity. (In [2], this is referred to 
as the long-time asymptotic approach — see, for example, |14| for the nonparametric approach in 
this case.) We shall consider the problem of estimating the diffusion function a when we have high 
sampling frequency, that is, the mesh of the partition, as defined by maxi=i n \ U — U-i\, tends to 
zero but the time horizon T remains fixed. See [2] for more on nonparametric estimation of the 
drift and the diffusion function in the case of high-frequency data, which is referred to as the infill 
asymptotic approach. The infill asymptotic approach has recently been combined with observation 
of the integrated diffusion process to estimate the drift and diffusion functions — see [3] . 

The earliest research on the problem of nonparametric estimation of the diffusion function a : 
R h- > (0, oo) seems to have been undertaken by [12J. In [20], kernel-smoothing estimation of a 
is utilized in a manner motivated by kernel estimation of the density functions, and a rate of 
convergence of the estimator of order n - m /( 2m + 1 ) [ s obtained for the case in which a is m times 
continuously differentiable. A similar approach is undertaken in |3], [21] and [25]. In [19], an L p -loss 
estimator for a is developed which, under suitable conditions, has a minimax rate of convergence 
that is also of order n - m /( 2m + 1 ). The same author constructs estimators for the coefficients of a 
diffusion process based on adaptive wavelet thresholding (with respect to an unknown degree of 
smoothness in the coefficient functions) in [T7] . Wavelet methods were also used in |13] to estimate 
the drift and diffusion functions and again, the rate of convergence was found to match the classical 
rate of convergence for kernel-based estimators. A kernel smoothing approach is used in |10j to 
define spatial and temporal estimators for a. This technique was later adapted by |16] to test 
diffusions for stationarity. 

In this paper, we propose instead to use a method based on the maximum penalized quasi- 
likelihood. We provide estimators of a that empirically behave at least as well as the other esti- 
mators developed in the literature up to the present time. We note additionally the connection 
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between the method that we will propose and kernel estimators via reproducing kernel Hilbert 
spaces, as is presented, for example, in [8j pp. 20-26]. 

2. Estimation Using Maximum Penalized Quasi-Likelihood 

2.1. Zero drift. The whole analysis below will be carried out assuming that 6 = 0. Analysis for 
other values for the drift coefficient could be similarly carried out, albeit in a more complicated 
way. However, this would not eventually serve any purpose in the case of high-frequency data. In 
fact, any analysis regarding consistency and rates of convergence that applies to the case of zero 
drift applies also immediately to the case of non-zero drift. The reason is that, in view of Girsanov's 
theorem, under very mild integrability assumptions on b/cr, the probabilities P^' ") and P^ ' ") are 
equivalent. In fact, in Section [3] we shall present a simulation study that explores the efficiency of 
our proposed diffusion coefficient estimator in the case of a diffusion with non-zero drift. 

2.2. Quasi-likelihood. Under the assumption that b = and that our observations are of high fre- 
quency, for each i = 1, . . . , n, the conditional law of given (Yj , Y^ , . . . , is approximately^] 
normal with mean Y ti _ 1 and standard deviation (r(Y ti l )y/ti — U-l- Use of this approximation 
permits us to construct the weighted quasi-log-likelihood of the sample, which is defined by 

(2.1) „«(,, I r„,...,Y*) -1±{ -mfov,)) - \[ ^-%1 t J }- 

(The symbol "qll" for the above function stands for "quasi-log-likelihood.") 

Of course, any estimator a that satisfies S : (lt i _ 1 ) = |i?t i _ 1 | at the values in the observation set 
O:={Y t0 ,Y tl ,...,Y tn _ 1 }, where 

(2.2) Rtii ,= ^^I^, i = l,..., n , 

V H — H—l 

would correspond to a maximum likelihood estimator with perfect fit to the data. (In particular, 
such an estimator is not unique.) As is typical in infinite-dimensional estimation problems (an- 
other example of which is nonparametric density estimation), naive interpolations of the points 
{(Yt i _ 1 ,\Rt i _ 1 \)}i=i,...,n result in estimators that oscillate wildly and are nonsensical. In order to 
effectively resolve this issue, one needs to impose some condition on the estimators. Nonparamet- 
ric estimation procedures frequently call for the function being estimated to possess some degree 
of differentiability, which restores the well-posedness of the optimization problem. In this paper, 
we undertake such an approach, penalizing lack of smoothness in estimates of a via a maximum 
penalized quasi-likelihood method. 

2.3. Maximum penalized quasi-likelihood estimator. Since a > 0, the transformation 

0:=-log(a) 



^Obviously, the quality of the approximation will improve as the mesh tends to zero. 
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is well-defined. The weighted quasi-likelihood is now 

i n 1 

(2-3) qll (fl | Y t0 , . . . ,Y tn ) = l£ {'CVJ " ^O"^}' 

i=l 

For the numerical schemes that we will be developing, it is useful to rewrite the above expression 
in terms of the order statistics of the points in D. For j = 1, . . . , n, let pj denote the name of the 
rank j point in D, with the smallest number having rank one. Selj^] 

(2.4) yj := Y pj and r,j := R py 

For future reference, we set yo := — oo and y n +i '■= +oo; note that these two points do not belong 
in our observations set D. With this new notation, we rewrite the weighted quasi-likelihood as 

where we drop the dependence of qll on the sample. As mentioned before, we incorporate a term 
that will penalize estimates for lack of smoothness to produce the penalized quasi-log-likelihood 

(2.5) pqll(0;m,A) = qll(0)-£ / \8^(z)\ 2 dz, 

where A > is a penalization factor, m E N, and 8^ is the derivative of order m of 8. We shall 
call 8* a maximum penalized quasi-likelihood estimator if it is a solution to the problem 

arg max pqll(0;m, A), 

~-"(m,2) 



where the maximization is over the space 

W (m,2) . = jj . M — yR j(m-i) exigtS; ig absolutely continuous, and / \f^ m \z)\ 2 dz < oo). 

(Note that w' m ' 2 ) as defined above is not the usual Sobolev space, which would be a particular 
subset of L 2 (M) — functions in W^ m ' 2 ^ might fail to be square-integrable.) 

It can be shown that the maximizer 8* of functionals like W^ m ' 2 ^ 3 8 \- V pqll (8; m, A) is a 
natural spline of order 2m — 1 with knots O = {yi, . . . ,y n } — see [9jj^] It follows that is at 
least 2m — 1 times differentiate, as well as piecewise polynomial of order 2m — 1 on all intervals 
{{Vki Uk+i)}k=o,...,n- (Recall that we are using the conventions yo = — oo and y n +i = +oo.) In 
particular, 6>i 2m ^ is piecewise constant on the intervals {{yk, Uk+i)}k=o,...,n- The fact that 8* is 
a natural spline means that all derivatives of order m, . . . ,2m — 1 vanish outside [yi,y n ]. Since 
8* is at least 2m — 2 times continuously differentiable, this implies that 8*\y\) = = 8*\y n ) 
for i = m, ... ,2m — 2; furthermore, di 2m (yi— ) = = 8^ m 1 \y n +), where 8^ m (y— ) and 
&i 2m ^(y+) will denote the left-hand and right-hand limit, respectively, of 8^ m ^ at y E M. 



2 We shall be assuming that there are no ties in the ranking. For an example of how we handle ties in the rankings, 



see Section Note, however, that in theory ties occur with probability zero 

^The solution 8 t is an example of an M-type estimator — see, for example, [5]. For more information about the 
general problem of fitting splines with restricted sets of values using penalty functions, see [26 
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Remark 2.1. By definition, the estimators have a number of derivatives equal to zero at the 
points 2/1 and y n . Of course, the "true" 9 = — log(cr) is not expected to have such behavior 
near the endpoints. The method of penalization over-smooths the estimator close to the extreme 
observations y\ and y n — this situation is unavoidable, since there are very few observations near 
these points. It would be an interesting topic for future research to investigate in a rigorous way 
the severity of this effect close to the extreme observations. 

In all that follows, we shall be freely using the fact that 9* has the above special structure. 
As 9^ m ^ is not uniquely defined at the points O = {yi, . . . , y n }, we agree to pick 9* such that 
^(2m l) . g r ig]^_ con ti nuouS) i_ e ; we enforce 9^ m l \yi) = 9 < f m l \yi+) to hold for all i = 1, . . . , n. 



Lemma 2.2. Let 9* be the maximizer of equation (2.5). Then, for any 5 G W*- m ' 2 \ 

1 - r 
(2.6) ^£{%,)(l-^ 2M%) )} = ^ 1 ) m_lA / ^ m - l \z)5'{z)dz. 



3=1 



Proof. To see this fact, assume that 9* is the maximizer of equation (2.5). For any 5 S W^ m ' 2 \ 
d 



0= — pqll(6»* + e5;m,X) 
oe 



_|( q , ( «. +rf) _* 



e=0 



By (2.3), we obtain 



0_ 

We 



1 n 

q|| +e g) = - V Ufa) - 5{y 3 )r 2 e 26 ^A. 



Furthermore, it is straightforward that 

9 (- f \9i m \z) + e5^(z)\ 2 dz) = A I 9 { r\z)5 i - m) (z)dz. 



de\2 



e=0 



Therefore, the first-order conditions for optimality become 



i n r 

~ E {<%) " S(yj)r]e 26 *^} = A / 9i m \z)5^(z)dz. 

n =1 1 J Jr 



Since 0*"^ vanishes outside [y± , y n ] , integration- by-parts implies that 

A [ 9 { r\z)5 {m) {z)dz = -\ I 9i m+l \z)5^ 1 \z)dz. 
Jr Jr 

Now, repeatedly integrating by parts and using the fact the 0* vanishes outside [yi,y n ] for i 
m, . . . , 2m — 1 yields the result. 



□ 



By considering appropriate functions 5 and substituting them into equation (2.6), we obtain the 
following result, which will be the basis for our algorithm to compute 9*. 

Proposition 2.3. The maximum penalized quasi-likelihood estimator 9* is such that 

k 



(2.7) 



> r2n ' -Mn { ';"'(;• Y^ry 2 ' 



nX 



holds for all y G [yk, yk+i), where k = 1,.. . ,n. (Recall that y n +\ = +oo, by convention.) 
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Proof. Fix y £ {yk->yk+i) for some k £ {!,..., n} and consider the function S y = I(_oo.j/)- I n 



that case, — b' y is a Dirac mass at y. Of course, equation (2.6) cannot be applied directly to a 
non-differentiable function like S y . To circumvent this issue, we use an approximating procedure. 
Consider the sequence of functions K 3 z i— )■ 6 y n( z ) = ${N(y — z)) for N £ N, where <3? is the 
cumulative distribution function of the standard normal distribution. Note that 5 Vi n £ W^" 1 ' 2 ) for 



m > 1 and that (f^jvJiVeN converges pointwise to 5 y at all points except for y. By equation (2.6) 
and typical arguments utilizing the distributional convergence of K 3 z i— >■ — {d / dz)8 y ^{z) to the 
Dirac mass at y, the equation 

(l-^ 2M%) )} = (-ir-'nX j^ m - l \z){^- z 8 y , N {z)}dz 



is valid for all N £ N. It also leads to equation ( |2.7[ ) upon sending to infinity as soon as 
one notices that lirnjv-->oo Y^=i ^y,N(Uj) = k for y £ {yk,Vk+i) and that ^ 2m ^ is constant on 
{VkiVk+l)- Recalling that we are considering the right-continuous version of oi 2m , we conclude 



that equation (2.7) is true for all y £ [yk,Uk+i)- ^ 



2.4. Computing the estimator. Assume that values for the penalization parameters A and m 



are prespecified. We will discuss possible ways of choosing A and m in subsection 2.5 We know 



Y to ,Y tl ,..., Y tn and hence the pairs {yi,n), (y 2 , r 2 ), . . . , (y n , r n ) given by equation (2.4). We want 



to determine 9i (yj) for i = 0, . . . , 2m — 1 and j = 1, . . . , n. These values of 9* and its higher-order 
derivatives along the observation set D = {yi, . . . ,y n } can be used to fit a spline, which then 
generates an estimate of 9 (and, therefore, of a = exp(— 9) as well) on the interval [yi,y n ]- The 



basic idea of the algorithm is to use equation (2.7) to obtain #( 2m at a particular yj and then 



to "work downwards" to the lower-order derivatives. 

We introduce further notation and definitions to support for the description of the iterative 



procedure induced by equation (2.7). For a = (a^°\ . . . , a( m_1 )), define ©(•; a) to be the spline of 
order 2m — 1 with knots D, with the properies that 0^(yi;a) = a® for i = 0, . . . ,m — 1, 0^ 
vanishes on (— oo, yi) for i = m, . . . , 2m — 1, and 

(2.8) & 2m ~ l Xy-a) = ^j- (k - £ rfe 20 ^) 

for all y £ [y^, y&+i), fe = 1, . . . , n, where again we are considering the right-continuous version of 
q(2to-1)^.. a ^ anc j ge ^- y n+1 = oo by convention. Note that the above properties characterize the 
spline 0(-;a) entirely in terms of a recursive procedure; we shall discuss how to compute all the 
values of 0(-;a) given a = (a^°\ . . . ,a^ m ~ 1 ^) £ IR m in the sequel. For the time being, and as a 



warm-up for the algorithm that will be presented below, note that equation (2.8) implies that 

(2.9) e^~ l \y k ; a) = O^W; a) + <=±£ (l - r 2 e 20 ^)) , k = 1, . . . , n, 

upon agreeing that O^ 2m_1 ^(yo; a) = as a matter of convention. 

In view of Proposition |2.3[ and since the estimator 0* must be a natural spline, it will hold that 
#*(•) = ©(•; a*) where a* £ M m is such that 9^(y n ; a*) = for i = m, . . . , 2m — 1. Therefore, our 
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goal is to obtain the root of the nonlinear equation F(a) = 0, where the mapping F : M m — > M. m 
is defined by 

F(a):=(e«(y n ;a)) 

V / i=m,...,2m—l 

In order to do this, an efficient way of computing F(a) for a given a = (o(°),...a( m - 1 )) G R m is 
required. The following pseudocode illustrates the computation of F(a) in the case m = 2, which 
is the value we mostly use for the numerical computations. It is straightforward to adapt the code 
for any value of m G N. 

Inputa = (a(°), a«) = (e^fa; a), 9«(ift;a)). 
Set 9( 2 )(yi;a) = 0, 6( 3 )(y ; a) = 0. 
For fe = 1, . . . , n — 1, set : 

e (2 Wi;«) = e( 2 )(y fc ;a) + 6( 3 )(y fc ;a)(y fc+1 

e«(y fc+1 ;a) = Q«(y fc ; a) + G^fe a)(y fc+1 - y fc ) + (1/2)9® a)(y fc+1 - y k ) 2 , 

e (0 Wi;o) = 0(o) (yfc; «) + 0(1) (yfc; - wO + (i/2)e( 2 )( yfc; a)( yfe+ i - yk f 

+ (l/Q)e^(y k ;a)(y k+1 -y k ) 3 . 

Next k 

Set G( 3 )(y n ;a) = e( 3 )(y„_i;a) + (nX)~ x (l - r£e e(0) (^ a )) 
Return F(a) = (G( 2 )(y n ; a), G( 3 )(y n ; a)). 

Note that the philosophy of our algorithm is a variant of the so-called "shooting method" [H 
page 177]. When close to the root a*, Newton's method can be utilized to get a better "aim," which 
will result in fast convergence of this iterative scheme. (Newton's method has to be used with care 
— if far away from the root, the method is not likely to work and the numerical scheme will fail to 
converge. For further comments on this problem, see subsection 2.6 ) Newton's method requires 
computation of the partial derivatives (d/da^)F, which can be easily computed along with F(a). 
Indeed, note that <pi(-; a) := (d/da^)Q^\-; a) for i = 0, . . . , m — 1 and j = 0, . . . , 2m — 1 satisfy 




Furthermore, upon differentiating (2.8), we obtain 

_ gggr%) _ ( -i r+ i ^ (0)( a) (0) 

i=i 

for all y G [y k ,y k +i), where k = 1, . . . ,<n. Therefore, computing these derivatives can be performed 
for little extra cost in the same iterative procedure used to compute @(yj;a) for j = l,...,n. 
Indeed, one has to simply differentiate the iterative equations with respect to a and obtain iterative 
equations for the derivatives As soon as the information about the m partial first-order derivatives 
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of F at a particular a 6 M m is obtained, we can implement Newton's method for finding the root 
of the equation F{a) = 0. Newton's method proceeds by successive approximation. Consider an 
initial guess a± = (ct^ )i=o,...,m-l an d inputs 5 > and e > 0, where 5 controls the step size in 
Newton's method and e determines when Newton's method is terminated^] Then, for k = 1, . . ., 
while |-F(afc)| > e (where |-| denotes the usual Euclidean norm), one sets a^+i = a&— 5&~ 1 (ak)F(ak)- 
In the previous formula, <& is the m x m matrix with entry ip^\y n ;a) in the i th row and the j th 
column for % = 0, . . . , m — 1 and j = 0, . . . , m — 1. (Note that we are numbering rows and columns 
from to m — 1, in order to be consistent with our notation.) 



Remark 2.4. There is accompanying software that implements the above algorithm, together with 
instructions regarding its use, available upon request from the authors. 

2.5. Choosing the penalization parameters. In the literature on nonparametric regression 
using the penalized likelihood method (also called the regularization method), the most typical 
choice for the order of differentiation to penalize is m = 2 — see, for example, [6] or [15]. The 
choice m = 2 gives rise to estimators that are natural cubic splines and are visually very attractive. 
The use of m = 1 results in estimated functions that are quite "wiggly" — see for example, Figure 
[TJ Use of m > 3 is computationally involved; for this reason, we refrain from such practice. 

The choice of the penalization coefficient A is more subtle. One can either use cross-validation 
techniques — see, for example, JjJJ or |15j . However, it is often that case that simple visual 
inspection of the graphs is sufficient. 

Theoretical results from the theory of nonparametric regression using the penalized likelihood 
method provide a hint in understanding how the penalization coefficient A should decrease with 
increasing sample size n when the degree of smoothness (as given by m) of the target function 
is fixed. More precisely, for fixed m, in it is conjectured that the choice A n ~ n - 2m /( 2m+1 ) as 
n — > oo will result in convergence of the estimator to the true value of the order n - m /( 2m + 1 ). p or 
theoretical background, see [6]. In the next section, we shall provide empirical results on rates of 
convergence for m = 1 and m = 2 that seem to support this conjecture. 

2.6. Practical remarks regarding convergence of our algorithm. If one uses the proposed 
penalization A n ~ n~ 2m /( 2m + 1 ) for a given m £ N as n — > oo, the numerical scheme suggested in 



Subsection 2.4 tends to be unstable for large sample sizes. In order to get a feeling for the reason, 



note that nX n ~ n i/(2m+i) ccm verges to infinity when n — > oo, albeit slower than n. Since one 



has to repeat the iteration given by (2.9) n times to obtain F(a), it is reasonable to suspect (and 
it actually happens in practice) that for choices of a G M m that are far away from a* the vector 
<& _1 (a)-F(a) (recall that <I> is the m x m matrix of first order partial derivatives of F) consists of 
entries with huge magnitude. This situation results in failure of convergence of the algorithm, since 
the updating step in Newton's method takes one father away from the sought-after root. In order 
to overcome this difficulty, it is often desirable to first consider a subset of the sample, effectively 
thinning the observations. Once the data set has been thinned enough in order for the method to 



4 In most of the examples in this paper, we took e = 10 10 and S — 0.1, the latter to prevent overstepping in 
Newton's method. 
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4.0 r 



3.5 - 



2.5 - 



-10 12 

Figure 1. We generate a sample path with At = 1/2 17 , no drift, and a = 3 and 
we also create two reductions of this sample path with At = 1/2 16 and At = 1/2 13 . 
Along with the true a, the resulting <r* are shown when A = 20(At) 2 / 3 and m = 1. 



converge, one can use the root derived from the thinned data set as initial seed for a denser subset 
of the data. Continuing this way, one finally computes the estimator of a using the whole sample. 
Note that we use this method in order to obtain the estimators in the following two sections. 



3. Simulation Study and Empirical Rates of Convergence 

In this section, we shall conduct an empirical investigation of the rate at which our estimators 
converge to the true volatility function. 

As a simple benchmark case when m = 1, we consider the case of a driftless Brownian motion 
with cr = 3. We generate a path of the Brownian motion using exact simulation, and produce data 
points with a step size of At = 2 -17 over the unit interval. Continuing, we remove every other 
realization of the sample path to arrive at a "reduction" of the original sample path with step sizes 
of At = 2~ 16 . We then remove every other observation three more times to create yet another 
reduction with step size At = 2~ 13 . The resulting estimates cr* of the true volatility function 
a = 3 are shown in Figure [l] using A = 20(Ai) 2 / 3 . (This choice is consistent with the discussion 



in subsection 2.5.) Notice that convergence to the constant volatility function a = 3 seems to be 
quite fast. 

To study the empirical rates of convergence more precisely, and to illustrate that our method 
works well even for diffusions with non-zero drift, we present a more involved example. Consider 
the diffusion Y with Y$ = 1/2 and dynamics 

(3.1) dY t = -Y t 2 (l-Y t )dt + Y t (l-Y t )dW t , te[0,l}. 

The diffusion Y is (0, l)-valued; in fact, a straightforward use of Ito's formula shows that 

(3 ' 2) y *-l + exp(m-V2)' ' €[0 ' 1] - 
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(A) 




FIGURE 2. We generate a sample path with At = 1/2 25 from the diffusion whose 



dynamics are given by equation (3.1). Then, we create 15 reductions of this sample 
path with At = 1/2 10 , 1/2 25 . (A) In the case m = 1, we plot log 2 RMSE^ 1 '^ 



,0" 



against q for q = 10, 25. The least-squares line is —1.383 — 0.343g, which is consis- 
tent with results from kernel-based estimation schemes which suggest that the slope 
should be —1/3 « —0.333. (B) In the case m = 2, we again plot log 2 KMSE(ai 2 ' 9 \ a) 
against q for q = 10, 25. The least-squares line is —1.024 — 0.398g, which is con- 
sistent with our conjecture which suggests that the slope should be —2/5 = —0.4. 



Note that Y does not have zero drift and that o~{y) = y{l — y). Furthermore, equation (3.2) gives 
a way to simulate Y without any discretization error, since one only needs to simulate W, which 
can be done exactly. We simulate a sample path with a step size 2 -25 and T = 1, and also create 
reduced versions of this sample path with step sizes At of 2~ 10 , 2 , . . . , 2 -25 . For the case m = 1, 
we use A = 30(At) 2//3 , while for the case m = 2 we use A = 20(At) 4 / 5 . For both m = 1 and m = 2 
and q = 10, . . . , 25, we produce penalized maximum quasi-likelihood estimators a 
compute the root mean-integrated squared error 



(m,q) 



We then 



RMlSE(ai m ' q \a 



\ 



T 



n 



where n = 2 q T is the sample size, as a proxy to the quantity 




Ah 



raj* 



a 



(m,q) 



(y) 



1 L^(y)dy, 



where Lj,(y) is the semimartingale local time of Y at level y £ R accumulated up to time T. We 
then plot log 2 (RMISE(cri m ' 9 ' ) , a)) against q and execute least-squares linear fits to determine the 
rate of convergence as we "fill in" the sample path — see Figure [2j The regression line in the case 
m = 1 was —1.383 — 0.343g. In the case m = 2, the regression line is —1.024 — 0.398*7. I R both 
cases, the slope of the line is roughly consistent with convergence results for kernel-based estimation 
schemes, which suggests that the slope should be —m/{2m + 1), giving —1/3 for m = 1 and —2/5 
for m = 2. We, therefore, conjecture (as already mentioned) that the rate of convergence is of order 
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n -m/(2m+i) when one chooses m as penalization differentiation order and uses \ n ~ n -2m/(2m+i) 
as n — > oo. Note that this conjectured rate is optimal, as demonstrated in [15] . 

4. Application to Exchange Rates and Interest Rates 

4.1. Exchange Rates. The three exchange rates were taken from the Federal Reserve Economic 
Data (FRED) database, which is maintained by the St. Louis branch of the Federal Reserve Bank. 
(The FRED database can be accessed at |http : //research . stlouisf ed . org/f red2} Registering 
for a username and password is required as of May 31, 2010.) For purposes of comparability, we 
study all three exchange rates from January 4, 1999 through May 21, 2010 (the Euro debuted on 
at the beginning of 1999). The FRED database contains noon buying rates in New York City 
for cable transfers payable in foreign currencies. It includes data only for days on which financial 
markets are open. 

Though exchange rate data at weekly and monthly reporting frequencies is also available through 
the FRED database, our nonparametric estimation technique performs best when the frequency 
of the data is relatively high. However, we did not attempt to use intra-day exchange rate data, 
even though such data is increasingly available for free on the Internet. (See, for example, the 
weblink http://www.forexrate.co.uk/forexhistoricaldata.php.) Though bid-ask spreads on 
exchange rates are generally quite low, much of the apparent volatility observed in financial asset 
prices observed with sufficiently high frequency (for example, one-minute time intervals) is due to 
buyers buying at the ask price and sellers selling at the bid price. This "toggling" between bid and 
ask prices during very high frequency trading biases volatility estimation upwards, rendering rules 
like the "square root of time multiplied by the volatility" ineffectual — see |22j . 



We first treat the data by computing the Rt i mentioned in equation (2.2). In particular, we 
are careful to compute yjti — U-i precisely and in calendar time, rather than by assuming that a 
calendar year has approximately 250 trading days and then dividing Yt i — Yt i _ 1 by y^l/250. This 
adjustment is small but potentially important, particularly when using high frequency data. For 
example, the standard deviation of the returns on the USD/EUR exchange rate over weekdays 
(or holidays) is 0.6457733%. Over weekends, the standard deviation is 0.675802%. With 624 
weekend/holiday trading days and 2866 weekday returns, the variance ratio test yields a test 
statistic of 1.0953 and a rejection of the null hypothesis of equality of variances at the 10% (though 
not the 5%) level of significance. In general, we will appropriately adjust for the length of the 
underlying time intervals, since there appears to be a slight but meaningful tendency for information 
to be released over the weekend that causes Friday-to-Monday returns to have somewhat higher 
dispersion than regular weekday returns. 

We also pre-process the data in one additional way to make it suitable for the numerical scheme 
articulated in Section [2| In this section, we required that sorted values of the asset prices (denoted 
Uj) be distinct from one another. In almost all diffusion models, the theoretical probability that two 
values Yt- and are equal is zero when i ^ j. In practical situations in financial markets, in which 
asset prices are only recorded to a finite number of decimal places, ties are possible and sorting the 
raw asset price data becomes an ill-defined task. To manage this problem, we tentatively compute 
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Figure 3. Under the assumption that the USD/EUR, USD/GBP, JPY/USD, and 
EUR/USD exchange rates for the period from January 4, 1999 to May 21, 2010 
are governed by diffusion processes, these figures show estimates <7* of the diffusion 
function a. 



the values of take their mean, and add to the raw asset price data a Gaussian random number 
with mean zero and variance equal to a very small constant times the mean of the R^. . We then 
recompute |-RtJ. By slightly perturbing the raw data, all ties are randomly broken and, in practice, 
the movement in the time series is very negligible. (One could also perform the perturbation only 
to the tied data — in practice, the two approaches lead to almost identical output.) 

In Figure 3a, we plot the estimates of the USD/EUR exchange rate volatility for two different 
values of A for the period from January 4, 1999 to May 21, 2010. Consistent with the discussion in 
Section [3J we choose A to be 0.02n~ 4//5 and 0.05n _4//5 , where n is equal to the number of returns in 
our time series — 2866 days, to be precise. The two estimates of the volatility in Figure 3a 



seem 



to indicate that using a linear function for a in equation (1.1) is not appropriate. Instead, the 
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volatility of the USD /EUR exchange rate seems to increase at a fairly rapid pace around values of 
1.30 USD/EUR. 

In Figure |3b[ we similarly show the estimates <7* for the USD/GDP exchange rate. The 
USD/GDP exchange rate volatility varies between 16.5% and 19% over [yi,y n ] from equation 



1.2 Note that the while the estimated diffusion function for the USD/EUR exchange rate appears 
to be monotonically increasing, the estimated USD /GBP exchange rate is not. Generally, diffu- 
sion models of financial phenomena assume that the diffusion function is either constant (like the 
Vasicek model) or that the diffusion function is a constant multiplied by some increasing function 
(geometric Brownian motion or the Cox-Ingersoll-Ross process) . We note that it would be interest- 
ing, and potentially the focus of future work, to use our estimated diffusion functions in hypothesis 
testing — for example, in this setting one could test whether the assumption that the USD/GDP 
exchange rate volatility is constant can be rejected at a high level of statistical significance. 



Finally, in Figure 3c and 3d, we show the estimates a* for the JPY/USD and EUR/USD exchange 
rates, respectively. Note that the estimated diffusion function for the JPY/USD exchange rate is 
also not monotonic and that it peaks when the exchange rate is between 95-100 Japanese yen per 
U.S. dollar. The estimated diffusion function associated to the EUR/USD exchange rate looks 



fairly similar to the function in Figure 3a 



4.2. LIBOR Rates, Treasury Bill Yields, and Treasury Bond Yields. We obtain constant- 
maturity 1-month U.S. Treasury bill yields, constant-maturity 3-month U.S. Treasury bill yields, 
and constant-maturity 30-year Treasury bond yields from the Federal Reserve Economic Data 
(FRED) database as well. (The time series designations in the FRED database for the constant- 
maturity 1-month U.S. Treasury bill yields, the constant-maturity 3- month U.S. Treasury bill 
yields, and 30-year U.S. Treasury bond yields are DGS1M0, DGS3MD, and GS30, respectively.) Again, 
the yields are recorded each day at 12:00 noon Eastern Standard Time. The FRED database 
contains yields recorded at the end of every week and the end of every month, but we chose to 
work with yields recorded at the end of each day on which financial markets were open. Addi- 



tionally, we computed the Rt i _ 1 in equation (2.2) correctly, fully accounting for long weekends and 



market holidays. We slightly randomly perturbed the raw data to break ties, as discussed in the 
previous section. Unlike the Treasury bill and bond yield data, we obtain overnight London Inter- 
bank Offered Rates (LIBOR) from Thompson Reuters Datastream. (For more information about 
Datastream, see http://thomsonreuters.com/products.) The LIBOR rate is the rate that large 
banks use to borrow and lend from one another on the overnight market. 

The periods of time for which the estimates a* were derived were: January 2, 1999 through May 
26, 2010 for LIBOR, July 31, 2001 through May 31, 2010 for 1-month U.S. Treasury bill yields, 
January 4, 1982 through Mary 21, 2010 for 3-month U.S. Treasury bill yields, and January 4, 1999 
through May 21, 2010 for 30-year U.S. Treasury bond yields. 



In Figure 4a, we plot the estimates of London Interbank Offered Rate (LIBOR) volatility for 
two different values of A. Our procedure generates an estimate of the volatility function, cr*, that is 
clearly neither linear nor a constant multiple of the square root function (as in the case of the Cox- 
Ingersoll-Ross process). Indeed, the volatility function is not even monotonic. Though similarly 
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Figure 4. Under the assumption that overnight LIBOR, 1-month U.S. Treasury 
bill yields, 3-month U.S. Treasury bill yields, and 30-year U.S. Treasury bond yields 
are governed by diffusion processes, these figures show estimates a* of the diffusion 
function a. The periods of time for which the estimates a* were derived are, respec- 
tively, January 2, 1999 through May 26, 2010 for LIBOR; July 31, 2001 through 
May 31, 2010 for 1-month U.S. Treasury bill yields; January 4, 1982 through Mary 
21, 2010 for 3-month U.S. Treasury bill yields; and January 4, 1999 through May 
21, 2010 for 30-year U.S. Treasury bond yields. 



non-monotonic, the estimate a* for one- month U.S. Treasury bill yields is low when yields are both 
relatively low and relatively high. The volatility of constant-maturity one-month U.S. Treasury 
bill yields seems to be highest when yields are around 3.75%. The estimates of a for 3- month 
U.S. Treasury bill yields and 30-year Treasury bond yields are remarkably different. The volatility 
function associated with 3-month U.S. Treasury bill yields appears to be a monotonic increasing 
function of the underlying yield. In contrast, the volatility function associated with 30-year U.S. 
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Treasury bond yields is monotonically decreasing function of the underlying yield yield. No function 
appears to adhere closely to any of the standard choices of a in the financial literature. 

References 

[1] U. Asher AND L. Petzold, Computer Methods for Ordinary Differential Equations and Differential Algebraic 

Equations, SIAM, Philadelphia, 1988. 
[2] F. Bandi and P. Phillips, Fully nonparametric estimation of scalar diffusion models, Econometrica, 71:1 

(2003), pp. 241-283. 

[3] G. Banon, Nonparametric identificaton for diffusion processes, SIAM J. Control and Optimization, 16:3 (1978), 
pp. 380-395. 

[4] F. Comte, V. Genon-Catalot, and Y. Rozenholc, Penalized nonparametric mean square estimation of the 

coefficients of diffusion processes, Bernoulli, 13:2 (2007), pp. 514-543. 
[5] D. Cox, Asymptotics for m-type smoothing splines, Ann. Stat., 11:2 (1983), pp. 530-551. 

[6] D. Cox AND F. O'Sullivan, Asymptotic analysis of penalized likelihood and related estimators, Ann. Statist., 
18:4 (1990), pp. 1676-1695. 

[7] P. Eggermont and V. LaRiccia, Maximum Penalized Likelihood Estimation, vol. I of Springer Series in 

Statistics, Springer- Verlag, New York, 2001. 
[8] , Maximum Penalized Likelihood Estimation, vol. II of Springer Series in Statistics, Springer- Verlag, New 

York, 2009. 

[9] R. Eubank, Spline Smoothing and Nonparametric Regression, vol. 90 of Statistics: Textbooks and Monographs, 
Marcel Dekker, Inc., New York, 1988. 

[10] J. Fan, Y. Fan, and J. Lv, Aggregation of nonparametric estimators for volatility matrix, J. Financial Econo- 
metrics, 5 (2007), pp. 321-357. 

[11] J. Fan and Q. Yao, Nonlinear Time Series: Nonparametric and Parametric Methods, Springer Series in 
Statistics, Springer, New York, 2005. 

[12] D. Florens-Zmirou, On estimating the diffusion coefficient from discrete observations, J. Appl. Probab., 30:4 
(1993), pp. 790-804. 

[13] V. Genon-Catalot, C. Laredo, and D. Picard, Nonparametric estimation of the diffusion coefficient by 
wavelet methods, Scand. J. Statist., 19 (1992), pp. 319-335. 

[14] E. Gobet, M. Hoffmann, and M. Reiss, Nonparametric estimation of scalar diffusions based on low frequency 
data, Ann. Statist., 32:5 (2004), pp. 2223-2253. 

[15] P. Green and B. Silverman, Nonparametric Regression and Generalized Linear Models, vol. 58 of Monographs 
on Statistics and Applied Probability, Chapman & Hall, London, 1994. 

[16] J. Hamrick and M. Taqqu, Testing diffusion processes for non-stationarity, Mathematical Methods of Oper- 
ations Research, 69:3 (2009), pp. 509-551. 

[17] M. Hoffmann, Adaptive estimation in diffusion processes, Stoch. Process. Appl., 79:1 (1999), pp. 135-163. 

[18] , Adaptive estimation in diffusion processes, Stochastic Process. Appl., 79:1 (1999), pp. 135-163. 

[19] , l p estimation of the diffusion coefficient, Bernoulli, 5:3 (1999), pp. 447-481. 

[20] J. Jacod, Non-parametric kernel estimation of the coefficient of a diffusion, Scand. J. Statist., 27:1 (2000), 
pp. 83-96. 

[21] G. Jiang and J. Knight, A nonparametric approach to the estimation of diffusion processes, with an application 
to a short-term interest rate model, Econometric Theory, 13:5 (1997), pp. 615-645. 

[22] P. Jorion, Financial Risk Manager Handbook, John Wiley & Sons, Hoboken, New Jersey, 2009. 

[23] I. Karatzas and S. Shreve, Brownian Motion and Stochastic Calculus, Springer- Verlag, New York, 1991. 

[24] Y. Kutoyants, Statistical Inference for Ergodic Diffusion Processes, Springer Series in Statistics, Springer- 
Verlag, London, 2004. 



MAXIMUM PENALIZED QUASI-LIKELIHOOD ESTIMATION OF THE DIFFUSION FUNCTION 17 

[25] P. Soulier, Nonparametric estimation of the diffusion coefficient of a diffusion process, Stochastic Anal. Appl., 
16 (1998), pp. 185-200. 

[26] F. Utreras, On computing robust splines and applications, SIAM J. Sci. Statist. Comput., 2 (1981), pp. 153- 
163. 

( Jeff Hamrick) Department of Mathematics and Computer Science, Rhodes College, 2000 N. Parkway, 
Memphis, TN 38112, USA 

E-mail address: hamrickj@rhodes.edu 

(Yifei Huang) Department of Mathematics and Statistics, Boston University, 111 Cummington Street, 
Boston, MA 02215, USA 

E-mail address: yifei@bu.edu 

(Constantinos Kardaras) Department of Mathematics and Statistics, Boston University, 111 Cumming- 
ton Street, Boston, MA 02215, USA 
E-mail address: kardaras@bu.edu 

(Murad S. Taqqu) Department of Mathematics and Statistics, Boston University, 111 Cummington 
Street, Boston, MA 02215, USA 
E-mail address: murad@bu.edu 



